SOTorrent: Reconstructing and Analyzing the Evolution of Stack Overflow Posts

نویسندگان

  • Sebastian Baltes
  • Lorik Dumani
  • Christoph Treude
  • Stephan Diehl
چکیده

Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and by collecting references from GitHub files to SO posts. In this paper, we describe how we built SOTorrent, and in particular how we evaluated 134 different string similarity metrics regarding their applicability for reconstructing the version history of text and code blocks. Based on a first analysis using the dataset, we present insights into the evolution of SO posts, e.g., that post edits are usually small, happen soon after the initial creation of the post, and that code is rarely changed without also updating the surrounding text. Further, our analysis revealed a close relationship between post edits and comments. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Are Code Examples on an Online Q&A Forum Reliable?

Programmers often consult an online Q&A forum such as Stack Overflow to learn new APIs. This paper presents an empirical study on the prevalence and severity of API misuse on Stack Overflow. To reduce manual assessment effort, we design Maple, an API usage mining approach that extracts patterns from over 380K Java repositories on GitHub and subsequently reports potential API usage violations in...

متن کامل

A Manual Categorization of Android App Development Issues Using Stack Overflow Posts

The discussion of issues related to the development of mobile applications (apps) has gained more and more popularity on Q&A-platforms such as Stack Overflow.1 Barua et al. [1] stated that Android is among the topics with the largest increase in the number of posts on Stack Overflow. The success of a mobile application depends on the quality of the application. Lineares-Vasquez et al. [2] found...

متن کامل

GitHub and Stack Overflow: Analyzing Developer Interests Across Multiple Social Collaborative Platforms

Increasingly, software developers are using a wide array of social collaborative platforms for software development and learning. In this work, we examined the similarities in developer’s interests within and across GitHub and Stack Overflow. Our study finds that developers share common interests in GitHub and Stack Overflow; on average, 39% of the GitHub repositories and Stack Overflow questio...

متن کامل

CASE-QA: Context and Syntax embeddings for Question Answering On Stack Overflow

Question answering (QA) systems rely on both knowledge bases and unstructured text corpora. Domain-specific QA presents a unique challenge, since relevant knowledge bases are often lacking and unstructured text is difficult to query and parse. This project focuses on the QUASAR-S dataset (Dhingra et al., 2017) constructed from the community QA site Stack Overflow. QUASAR-S consists of Cloze-sty...

متن کامل

Detecting Duplicate Posts in Programming QA Communities via Latent Semantics and Association Rules

Programming community-based question-answering (PCQA) websites such as Stack Overflow enable programmers to find working solutions to their questions. Despite detailed posting guidelines, duplicate questions that have been answered are frequently created. To tackle this problem, Stack Overflow provides a mechanism for reputable users to manually mark duplicate questions. This is a laborious eff...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2018